Computation and Language 33
☆ Auditing Gender Presentation Differences in Text-to-Image Models
Text-to-image models, which can generate high-quality images based on textual
input, have recently enabled various content-creation tools. Despite
significantly affecting a wide range of downstream applications, the
distributions of these generated images are still not fully understood,
especially when it comes to the potential stereotypical attributes of different
genders. In this work, we propose a paradigm (Gender Presentation Differences)
that utilizes fine-grained self-presentation attributes to study how gender is
presented differently in text-to-image models. By probing gender indicators in
the input text (e.g., "a woman" or "a man"), we quantify the frequency
differences of presentation-centric attributes (e.g., "a shirt" and "a dress")
through human annotation and introduce a novel metric: GEP. Furthermore, we
propose an automatic method to estimate such differences. The automatic GEP
metric based on our approach yields a higher correlation with human annotations
than that based on existing CLIP scores, consistently across three
state-of-the-art text-to-image models. Finally, we demonstrate the
generalization ability of our metrics in the context of gender stereotypes
related to occupations.
comment: Preprint, 23 pages, 14 figures
☆ Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
The strength of modern generative models lies in their ability to be
controlled through text-based prompts. Typical "hard" prompts are made from
interpretable words and tokens, and must be hand-crafted by humans. There are
also "soft" prompts, which consist of continuous feature vectors. These can be
discovered using powerful optimization methods, but they cannot be easily
interpreted, re-used across models, or plugged into a text-based interface.
We describe an approach to robustly optimize hard text prompts through
efficient gradient-based optimization. Our approach automatically generates
hard text-based prompts for both text-to-image and text-to-text applications.
In the text-to-image setting, the method creates hard prompts for diffusion
models, allowing API users to easily generate, discover, and mix and match
image concepts without prior knowledge on how to prompt the model. In the
text-to-text setting, we show that hard prompts can be automatically discovered
that are effective in tuning LMs for classification.
comment: 14 pages, 10 figures, Code is available at
\url{https://github.com/YuxinWenRick/hard-prompts-made-easy}
☆ Exploitation and exploration in text evolution. Quantifying planning and translation flows during writing
Writing is a complex process at the center of much of modern human activity.
Despite it appears to be a linear process, writing conceals many highly
non-linear processes. Previous research has focused on three phases of writing:
planning, translation and transcription, and revision. While research has shown
these are non-linear, they are often treated linearly when measured. Here, we
introduce measures to detect and quantify subcycles of planning (exploration)
and translation (exploitation) during the writing process. We apply these to a
novel dataset that recorded the creation of a text in all its phases, from
early attempts to the finishing touches on a final version. This dataset comes
from a series of writing workshops in which, through innovative versioning
software, we were able to record all the steps in the construction of a text.
More than 60 junior researchers in science wrote a scientific essay intended
for a general readership. We recorded each essay as a writing cloud, defined as
a complex topological structure capturing the history of the essay itself.
Through this unique dataset of writing clouds, we expose a representation of
the writing process that quantifies its complexity and the writer's efforts
throughout the draft and through time. Interestingly, this representation
highlights the phases of "translation flow", where authors improve existing
ideas, and exploration, where creative deviations appear as the writer returns
to the planning phase. These turning points between translation and exploration
become rarer as the writing process progresses and the author approaches the
final version. Our results and the new measures introduced have the potential
to foster the discussion about the non-linear nature of writing and support the
development of tools that can support more creative and impactful writing
processes.
☆ CALaMo: a Constructionist Assessment of Language Models
This paper presents a novel framework for evaluating Neural Language Models'
linguistic abilities using a constructionist approach. Not only is the
usage-based model in line with the underlying stochastic philosophy of neural
architectures, but it also allows the linguist to keep meaning as a determinant
factor in the analysis. We outline the framework and present two possible
scenarios for its application.
☆ Efficiently Upgrading Multilingual Machine Translation Models to Support More Languages EACL 2023
With multilingual machine translation (MMT) models continuing to grow in size
and number of supported languages, it is natural to reuse and upgrade existing
models to save computation as data becomes available in more languages.
However, adding new languages requires updating the vocabulary, which
complicates the reuse of embeddings. The question of how to reuse existing
models while also making architectural changes to provide capacity for both old
and new languages has also not been closely studied. In this work, we introduce
three techniques that help speed up effective learning of the new languages and
alleviate catastrophic forgetting despite vocabulary and architecture
mismatches. Our results show that by (1) carefully initializing the network,
(2) applying learning rate scaling, and (3) performing data up-sampling, it is
possible to exceed the performance of a same-sized baseline model with 30%
computation and recover the performance of a larger model trained from scratch
with over 50% reduction in computation. Furthermore, our analysis reveals that
the introduced techniques help learn the new directions more effectively and
alleviate catastrophic forgetting at the same time. We hope our work will guide
research into more efficient approaches to growing languages for these MMT
models and ultimately maximize the reuse of existing models.
comment: Accepted to EACL 2023 (Main)
☆ A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends
As more and more Arabic texts emerged on the Internet, extracting important
information from these Arabic texts is especially useful. As a fundamental
technology, Named entity recognition (NER) serves as the core component in
information extraction technology, while also playing a critical role in many
other Natural Language Processing (NLP) systems, such as question answering and
knowledge graph building. In this paper, we provide a comprehensive review of
the development of Arabic NER, especially the recent advances in deep learning
and pre-trained language model. Specifically, we first introduce the background
of Arabic NER, including the characteristics of Arabic and existing resources
for Arabic NER. Then, we systematically review the development of Arabic NER
methods. Traditional Arabic NER systems focus on feature engineering and
designing domain-specific rules. In recent years, deep learning methods achieve
significant progress by representing texts via continuous vector
representations. With the growth of pre-trained language model, Arabic NER
yields better performance. Finally, we conclude the method gap between Arabic
NER and NER methods from other languages, which helps outline future directions
for Arabic NER.
comment: arXiv admin note: text overlap with arXiv:2210.09263 by other authors
☆ Cluster-Level Contrastive Learning for Emotion Recognition in Conversations
A key challenge for Emotion Recognition in Conversations (ERC) is to
distinguish semantically similar emotions. Some works utilise Supervised
Contrastive Learning (SCL) which uses categorical emotion labels as supervision
signals and contrasts in high-dimensional semantic space. However, categorical
labels fail to provide quantitative information between emotions. ERC is also
not equally dependent on all embedded features in the semantic space, which
makes the high-dimensional SCL inefficient. To address these issues, we propose
a novel low-dimensional Supervised Cluster-level Contrastive Learning (SCCL)
method, which first reduces the high-dimensional SCL space to a
three-dimensional affect representation space Valence-Arousal-Dominance (VAD),
then performs cluster-level contrastive learning to incorporate measurable
emotion prototypes. To help modelling the dialogue and enriching the context,
we leverage the pre-trained knowledge adapters to infuse linguistic and factual
knowledge. Experiments show that our method achieves new state-of-the-art
results with 69.81% on IEMOCAP, 65.7% on MELD, and 62.51% on DailyDialog
datasets. The analysis also proves that the VAD space is not only suitable for
ERC but also interpretable, with VAD prototypes enhancing its performance and
stabilising the training of SCCL. In addition, the pre-trained knowledge
adapters benefit the performance of the utterance encoder and SCCL. Our code is
available at: https://github.com/SteveKGYang/SCCL
comment: Accepted by IEEE Transactions on Affective Computing
☆ Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models
Learned metrics such as BLEURT have in recent years become widely employed to
evaluate the quality of machine translation systems. Training such metrics
requires data which can be expensive and difficult to acquire, particularly for
lower-resource languages. We show how knowledge can be distilled from Large
Language Models (LLMs) to improve upon such learned metrics without requiring
human annotators, by creating synthetic datasets which can be mixed into
existing datasets, requiring only a corpus of text in the target language. We
show that the performance of a BLEURT-like model on lower resource languages
can be improved in this way.
☆ Natural Language Processing for Policymaking
Language is the medium for many political activities, from campaigns to news
reports. Natural language processing (NLP) uses computational tools to parse
text into key information that is needed for policymaking. In this chapter, we
introduce common methods of NLP, including text classification, topic modeling,
event extraction, and text scaling. We then overview how these methods can be
used for policymaking through four major applications including data collection
for evidence-based policymaking, interpretation of political decisions, policy
communication, and investigation of policy effects. Finally, we highlight some
potential limitations and ethical concerns when using NLP for policymaking.
This text is from Chapter 7 (pages 141-162) of the Handbook of Computational
Social Science for Policy (2023). Open Access on Springer:
https://doi.org/10.1007/978-3-031-16624-2
comment: Handbook of Computational Social Science for Policy (2023), Chapter 7
(pages 141-162). Open Access on Springer:
https://doi.org/10.1007/978-3-031-16624-2
☆ Entity-Aware Dual Co-Attention Network for Fake News Detection EACL 2023
Fake news and misinformation spread rapidly on the Internet. How to identify
it and how to interpret the identification results have become important
issues. In this paper, we propose a Dual Co-Attention Network (Dual-CAN) for
fake news detection, which takes news content, social media replies, and
external knowledge into consideration. Our experimental results support that
the proposed Dual-CAN outperforms current representative models in two
benchmark datasets. We further make in-depth discussions by comparing how
models work in both datasets with empirical analysis of attention weights.
comment: EACL 2023 Findings
☆ What do Language Models know about word senses? Zero-Shot WSD with Language Models and Domain Inventories
Language Models are the core for almost any Natural Language Processing
system nowadays. One of their particularities is their contextualized
representations, a game changer feature when a disambiguation between word
senses is necessary. In this paper we aim to explore to what extent language
models are capable of discerning among senses at inference time. We performed
this analysis by prompting commonly used Languages Models such as BERT or
RoBERTa to perform the task of Word Sense Disambiguation (WSD). We leverage the
relation between word senses and domains, and cast WSD as a textual entailment
problem, where the different hypothesis refer to the domains of the word
senses. Our results show that this approach is indeed effective, close to
supervised systems.
comment: Presented at GWC2023
☆ The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study WWW 2023
Due to the exponential growth of scientific publications on the Web, there is
a pressing need to tag each paper with fine-grained topics so that researchers
can track their interested fields of study rather than drowning in the whole
literature. Scientific literature tagging is beyond a pure multi-label text
classification task because papers on the Web are prevalently accompanied by
metadata information such as venues, authors, and references, which may serve
as additional signals to infer relevant tags. Although there have been studies
making use of metadata in academic paper classification, their focus is often
restricted to one or two scientific fields (e.g., computer science and
biomedicine) and to one specific model. In this work, we systematically study
the effect of metadata on scientific literature tagging across 19 fields. We
select three representative multi-label classifiers (i.e., a bag-of-words
model, a sequence-based model, and a pre-trained language model) and explore
their performance change in scientific literature tagging when metadata are fed
to the classifiers as additional features. We observe some ubiquitous patterns
of metadata's effects across all fields (e.g., venues are consistently
beneficial to paper tagging in almost all cases), as well as some unique
patterns in fields other than computer science and biomedicine, which are not
explored in previous studies.
comment: 11 pages; Accepted to WWW 2023
☆ Learning Manner of Execution from Partial Corrections
Some actions must be executed in different ways depending on the context. For
example, wiping away marker requires vigorous force while wiping away almonds
requires more gentle force. In this paper we provide a model where an agent
learns which manner of action execution to use in which context, drawing on
evidence from trial and error and verbal corrections when it makes a mistake
(e.g., ``no, gently''). The learner starts out with a domain model that lacks
the concepts denoted by the words in the teacher's feedback; both the words
describing the context (e.g., marker) and the adverbs like ``gently''. We show
that through the the semantics of coherence, our agent can perform the symbol
grounding that's necessary for exploiting the teacher's feedback so as to solve
its domain-level planning problem: to perform its actions in the current
context in the right way.
☆ AutoWS: Automated Weak Supervision Framework for Text Classification
Creating large, good quality labeled data has become one of the major
bottlenecks for developing machine learning applications. Multiple techniques
have been developed to either decrease the dependence of labeled data
(zero/few-shot learning, weak supervision) or to improve the efficiency of
labeling process (active learning). Among those, Weak Supervision has been
shown to reduce labeling costs by employing hand crafted labeling functions
designed by domain experts. We propose AutoWS -- a novel framework for
increasing the efficiency of weak supervision process while decreasing the
dependency on domain experts. Our method requires a small set of labeled
examples per label class and automatically creates a set of labeling functions
to assign noisy labels to numerous unlabeled data. Noisy labels can then be
aggregated into probabilistic labels used by a downstream discriminative
classifier. Our framework is fully automatic and requires no hyper-parameter
specification by users. We compare our approach with different state-of-the-art
work on weak supervision and noisy training. Experimental results show that our
method outperforms competitive baselines.
☆ PLACES: Prompting Language Models for Social Conversation Synthesis EACL 2023
Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, Dilek Hakkani-Tur
Collecting high quality conversational data can be very expensive for most
applications and infeasible for others due to privacy, ethical, or similar
concerns. A promising direction to tackle this problem is to generate synthetic
dialogues by prompting large language models. In this work, we use a small set
of expert-written conversations as in-context examples to synthesize a social
conversation dataset using prompting. We perform several thorough evaluations
of our synthetic conversations compared to human-collected conversations. This
includes various dimensions of conversation quality with human evaluation
directly on the synthesized conversations, and interactive human evaluation of
chatbots fine-tuned on the synthetically generated dataset. We additionally
demonstrate that this prompting approach is generalizable to multi-party
conversations, providing potential to create new synthetic data for multi-party
tasks. Our synthetic multi-party conversations were rated more favorably across
all measured dimensions compared to conversation excerpts sampled from a
human-collected multi-party dataset.
comment: In EACL 2023. 25 pages, 4 figures, 26 tables. Link to code
forthcoming
☆ Continual Learning of Language Models ICLR 2023
Language models (LMs) have been instrumental for the rapid advance of natural
language processing. This paper studies continual learning of LMs, in
particular, continual domain-adaptive pre-training (or continual DAP-training).
Existing research has shown that further pre-training an LM using a domain
corpus to adapt the LM to the domain can improve the end-task performance in
the domain. This paper proposes a novel method to continually DAP-train an LM
with a sequence of unlabeled domain corpora to adapt the LM to these domains to
improve their end-task performances. The key novelty of our method is a
soft-masking mechanism that directly controls the update to the LM. A novel
proxy is also proposed to preserve the general knowledge in the original LM.
Additionally, it contrasts the representations of the previously learned domain
knowledge (including the general knowledge in the pre-trained LM) and the
knowledge from the current full network to achieve knowledge integration. The
method not only overcomes catastrophic forgetting, but also achieves knowledge
transfer to improve end-task performances. Empirical evaluation demonstrates
the effectiveness of the proposed method.
comment: ICLR 2023
☆ Bringing the State-of-the-Art to Customers: A Neural Agent Assistant Framework for Customer Service Support EMNLP 2022
Stephen Obadinma, Faiza Khan Khattak, Shirley Wang, Tania Sidhom, Elaine Lau, Sean Robertson, Jingcheng Niu, Winnie Au, Alif Munim, Karthik Raja K. Bhaskar, Bencheng Wei, Iris Ren, Waqar Muhammad, Erin Li, Bukola Ishola, Michael Wang, Griffin Tanner, Yu-Jia Shiah, Sean X. Zhang, Kwesi P. Apponsah, Kanishk Patel, Jaswinder Narain, Deval Pandya, Xiaodan Zhu, Frank Rudzicz, Elham Dolatabadi
Building Agent Assistants that can help improve customer service support
requires inputs from industry users and their customers, as well as knowledge
about state-of-the-art Natural Language Processing (NLP) technology. We combine
expertise from academia and industry to bridge the gap and build
task/domain-specific Neural Agent Assistants (NAA) with three high-level
components for: (1) Intent Identification, (2) Context Retrieval, and (3)
Response Generation. In this paper, we outline the pipeline of the NAA's core
system and also present three case studies in which three industry partners
successfully adapt the framework to find solutions to their unique challenges.
Our findings suggest that a collaborative process is instrumental in spurring
the development of emerging NLP models for Conversational AI tasks in industry.
The full reference implementation code and results are available at
\url{https://github.com/VectorInstitute/NAA}
comment: Camera Ready Version of Paper Published in EMNLP 2022 Industry Track
☆ An entity-guided text summarization framework with relational heterogeneous graph neural network
Two crucial issues for text summarization to generate faithful summaries are
to make use of knowledge beyond text and to make use of cross-sentence
relations in text. Intuitive ways for the two issues are Knowledge Graph (KG)
and Graph Neural Network (GNN) respectively. Entities are semantic units in
text and in KG. This paper focuses on both issues by leveraging entities
mentioned in text to connect GNN and KG for summarization. Firstly, entities
are leveraged to construct a sentence-entity graph with weighted multi-type
edges to model sentence relations, and a relational heterogeneous GNN for
summarization is proposed to calculate node encodings. Secondly, entities are
leveraged to link the graph to KG to collect knowledge. Thirdly, entities guide
a two-step summarization framework defining a multi-task selector to select
salient sentences and entities, and using an entity-focused abstractor to
compress the sentences. GNN is connected with KG by constructing
sentence-entity graphs where entity-entity edges are built based on KG,
initializing entity embeddings on KG, and training entity embeddings using
entity-entity edges. The relational heterogeneous GNN utilizes both edge
weights and edge types in GNN to calculate graphs with weighted multi-type
edges. Experiments show the proposed method outperforms extractive baselines
including the HGNN-based HGNNSum and abstractive baselines including the
entity-driven SENECA on CNN/DM, and outperforms most baselines on NYT50.
Experiments on sub-datasets show the density of sentence-entity edges greatly
influences the performance of the proposed method. The greater the density, the
better the performance. Ablations show effectiveness of the method.
comment: 7 tables, 5 figures
☆ Exploring the Benefits of Training Expert Language Models over Instruction Tuning
Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, Minjoon Seo
Recently, Language Models (LMs) instruction-tuned on multiple tasks, also
known as multitask-prompted fine-tuning (MT), have shown the capability to
generalize to unseen tasks. Previous work has shown that scaling the number of
training tasks is the key component in making stronger MT LMs. In this work, we
report an unexpected finding that an expert LM fine-tuned on just a single task
can outperform an MT LM trained with 300+ different tasks on 11 different
unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean
accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the
previously held belief that simply scaling the number of tasks makes stronger
MT LMs. Leveraging this finding, we further show that this distributed approach
of training a separate expert LM per training task instead of a single MT LM
for zero-shot inference possesses many benefits including (1) avoiding negative
task transfer that often occurs during instruction tuning, (2) being able to
continually learn new tasks without having to re-train on previous tasks to
avoid catastrophic forgetting, and (3) showing compositional capabilities when
merging individual experts together. The code is available at
https://github.com/joeljang/ELM.
☆ UDApter -- Efficient Domain Adaptation Using Adapters
We propose two methods to make unsupervised domain adaptation (UDA) more
parameter efficient using adapters, small bottleneck layers interspersed with
every layer of the large-scale pre-trained language model (PLM). The first
method deconstructs UDA into a two-step process: first by adding a domain
adapter to learn domain-invariant information and then by adding a task adapter
that uses domain-invariant information to learn task representations in the
source domain. The second method jointly learns a supervised classifier while
reducing the divergence measure. Compared to strong baselines, our simple
methods perform well in natural language inference (MNLI) and the cross-domain
sentiment classification task. We even outperform unsupervised domain
adaptation methods such as DANN and DSN in sentiment classification, and we are
within 0.85% F1 for natural language inference task, by fine-tuning only a
fraction of the full model parameters. We release our code at
https://github.com/declare-lab/UDAPTER
☆ Capturing Topic Framing via Masked Language Modeling EMNLP 2022
Differential framing of issues can lead to divergent world views on important
issues. This is especially true in domains where the information presented can
reach a large audience, such as traditional and social media. Scalable and
reliable measurement of such differential framing is an important first step in
addressing them. In this work, based on the intuition that framing affects the
tone and word choices in written language, we propose a framework for modeling
the differential framing of issues through masked token prediction via
large-scale fine-tuned language models (LMs). Specifically, we explore three
key factors for our framework: 1) prompt generation methods for the masked
token prediction; 2) methods for normalizing the output of fine-tuned LMs; 3)
robustness to the choice of pre-trained LMs used for fine-tuning. Through
experiments on a dataset of articles from traditional media outlets covering
five diverse and politically polarized topics, we show that our framework can
capture differential framing of these topics with high reliability.
comment: In Findings of EMNLP 2022
♻ ☆ A Non-monotonic Self-terminating Language Model ICLR 2023
Recent large-scale neural autoregressive sequence models have shown
impressive performances on a variety of natural language generation tasks.
However, their generated sequences often exhibit degenerate properties such as
non-termination, undesirable repetition, and premature termination, when
generated with decoding algorithms such as greedy search, beam search, top-$k$
sampling, and nucleus sampling. In this paper, we focus on the problem of
non-terminating sequences resulting from an incomplete decoding algorithm. We
first define an incomplete probable decoding algorithm which includes greedy
search, top-$k$ sampling, and nucleus sampling, beyond the incomplete decoding
algorithm originally put forward by Welleck et al. (2020). We then propose a
non-monotonic self-terminating language model, which significantly relaxes the
constraint of monotonically increasing termination probability in the
originally proposed self-terminating language model by Welleck et al. (2020),
to address the issue of non-terminating sequences when using incomplete
probable decoding algorithms. We prove that our proposed model prevents
non-terminating sequences when using not only incomplete probable decoding
algorithms but also beam search. We empirically validate our model on sequence
completion tasks with various architectures.
comment: Published as a conference paper at ICLR 2023
♻ ☆ N-Gram Nearest Neighbor Machine Translation
Nearest neighbor machine translation augments the Autoregressive
Translation~(AT) with $k$-nearest-neighbor retrieval, by comparing the
similarity between the token-level context representations of the target tokens
in the query and the datastore. However, the token-level representation may
introduce noise when translating ambiguous words, or fail to provide accurate
retrieval results when the representation generated by the model contains
indistinguishable context information, e.g., Non-Autoregressive
Translation~(NAT) models. In this paper, we propose a novel $n$-gram nearest
neighbor retrieval method that is model agnostic and applicable to both AT and
NAT models. Specifically, we concatenate the adjacent $n$-gram hidden
representations as the key, while the tuple of corresponding target tokens is
the value. In inference, we propose tailored decoding algorithms for AT and NAT
models respectively. We demonstrate that the proposed method consistently
outperforms the token-level method on both AT and NAT models as well on general
as on domain adaptation translation tasks. On domain adaptation, the proposed
method brings $1.03$ and $2.76$ improvements regarding the average BLEU score
on AT and NAT models respectively.
♻ ☆ FADO: Feedback-Aware Double COntrolling Network for Emotional Support Conversation SC
Emotional Support Conversation (ESConv) aims to reduce help-seekers'emotional
distress with the supportive strategy and response. It is essential for the
supporter to select an appropriate strategy with the feedback of the
help-seeker (e.g., emotion change during dialog turns, etc) in ESConv. However,
previous methods mainly focus on the dialog history to select the strategy and
ignore the help-seeker's feedback, leading to the wrong and user-irrelevant
strategy prediction. In addition, these approaches only model the
context-to-strategy flow and pay less attention to the strategy-to-context flow
that can focus on the strategy-related context for generating the
strategy-constrain response. In this paper, we propose a Feedback-Aware Double
COntrolling Network (FADO) to make a strategy schedule and generate the
supportive response. The core module in FADO consists of a dual-level feedback
strategy selector and a double control reader. Specifically, the dual-level
feedback strategy selector leverages the turn-level and conversation-level
feedback to encourage or penalize strategies. The double control reader
constructs the novel strategy-to-context flow for generating the
strategy-constrain response. Furthermore, a strategy dictionary is designed to
enrich the semantic information of the strategy and improve the quality of
strategy-constrain response. Experimental results on ESConv show that the
proposed FADO has achieved the state-of-the-art performance in terms of both
strategy selection and response generation. Our code is available at
https://github.com/Thedatababbler/FADO.
comment: Accepted on Knowl. Based Syst. (SCI I)
♻ ☆ Recent Advances in Neural Text Generation: A Task-Agnostic Survey
In recent years much effort has been devoted to applying neural models to the
task of natural language generation. The challenge is to generate natural
human-like text, and to control the generation process. This paper presents a
task-agnostic survey of recent advances in neural text generation. These
advances have been achieved by numerous developments, which we group under the
following four headings: data construction, neural frameworks, training and
inference strategies, and evaluation metrics. Finally we discuss the future
directions for the development of neural text generation including neural
pipelines and exploiting back-ground knowledge.
♻ ☆ Scaling Back-Translation with Domain Text Generation for Sign Language Gloss Translation EACL 2023
Sign language gloss translation aims to translate the sign glosses into
spoken language texts, which is challenging due to the scarcity of labeled
gloss-text parallel data. Back translation (BT), which generates
pseudo-parallel data by translating in-domain spoken language texts into sign
glosses, has been applied to alleviate the data scarcity problem. However, the
lack of large-scale high-quality domain spoken language text data limits the
effect of BT. In this paper, to overcome the limitation, we propose a Prompt
based domain text Generation (PGEN) approach to produce the large-scale
in-domain spoken language text data. Specifically, PGEN randomly concatenates
sentences from the original in-domain spoken language text data as prompts to
induce a pre-trained language model (i.e., GPT-2) to generate spoken language
texts in a similar style. Experimental results on three benchmarks of sign
language gloss translation in varied languages demonstrate that BT with spoken
language texts generated by PGEN significantly outperforms the compared
methods. In addition, as the scale of spoken language texts generated by PGEN
increases, the BT technique can achieve further improvements, demonstrating the
effectiveness of our approach. We release the code and data for facilitating
future research in this field.
comment: Accepted at EACL 2023 (main conference)
♻ ☆ Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend
Word-level textual adversarial attacks have achieved striking performance in
fooling natural language processing models. However, the fundamental questions
of why these attacks are effective, and the intrinsic properties of the
adversarial examples (AEs), are still not well understood. This work attempts
to interpret textual attacks through the lens of $n$-gram frequency.
Specifically, it is revealed that existing word-level attacks exhibit a strong
tendency toward generation of examples with $n$-gram frequency descend
($n$-FD). Intuitively, this finding suggests a natural way to improve model
robustness by training the model on the $n$-FD examples. To verify this idea,
we devise a model-agnostic and gradient-free AE generation approach that relies
solely on the $n$-gram frequency information, and further integrate it into the
recently proposed convex hull framework for adversarial training. Surprisingly,
the resultant method performs quite similarly to the original gradient-based
method in terms of model robustness. These findings provide a
human-understandable perspective for interpreting word-level textual
adversarial attacks, and a new direction to improve model robustness.
comment: 8 pages, 4 figures. In progress
♻ ☆ Prompting Neural Machine Translation with Translation Memories AAAI 2023
Improving machine translation (MT) systems with translation memories (TMs) is
of great interest to practitioners in the MT community. However, previous
approaches require either a significant update of the model architecture and/or
additional training efforts to make the models well-behaved when TMs are taken
as additional input. In this paper, we present a simple but effective method to
introduce TMs into neural machine translation (NMT) systems. Specifically, we
treat TMs as prompts to the NMT model at test time, but leave the training
process unchanged. The result is a slight update of an existing NMT system,
which can be implemented in a few hours by anyone who is familiar with NMT.
Experimental results on several datasets demonstrate that our system
significantly outperforms strong baselines.
comment: Accepted to AAAI 2023
♻ ☆ A Comprehensive Comparison of Pre-training Language Models
Recently, the development of pre-trained language models has brought natural
language processing (NLP) tasks to the new state-of-the-art. In this paper we
explore the efficiency of various pre-trained language models. We pre-train a
list of transformer-based models with the same amount of text and the same
training steps. The experimental results shows that the most improvement upon
the origin BERT is adding the RNN-layer to capture more contextual information
for short text understanding. But the conclusion is: There are no remarkable
improvement for short text understanding for similar BERT structures.
Data-centric method[12] can achieve better performance.
♻ ☆ Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions EACL 2023
In recent years, progress in NLU has been driven by benchmarks. These
benchmarks are typically collected by crowdsourcing, where annotators write
examples based on annotation instructions crafted by dataset creators. In this
work, we hypothesize that annotators pick up on patterns in the crowdsourcing
instructions, which bias them to write many similar examples that are then
over-represented in the collected data. We study this form of bias, termed
instruction bias, in 14 recent NLU benchmarks, showing that instruction
examples often exhibit concrete patterns, which are propagated by crowdworkers
to the collected data. This extends previous work (Geva et al., 2019) and
raises a new concern of whether we are modeling the dataset creator's
instructions, rather than the task. Through a series of experiments, we show
that, indeed, instruction bias can lead to overestimation of model performance,
and that models struggle to generalize beyond biases originating in the
crowdsourcing instructions. We further analyze the influence of instruction
bias in terms of pattern frequency and model size, and derive concrete
recommendations for creating future NLU benchmarks.
comment: Accepted to EACL 2023
♻ ☆ MetaQA: Combining Expert Agents for Multi-Skill Question Answering EACL 2023
The recent explosion of question answering (QA) datasets and models has
increased the interest in the generalization of models across multiple domains
and formats by either training on multiple datasets or by combining multiple
models. Despite the promising results of multi-dataset models, some domains or
QA formats may require specific architectures, and thus the adaptability of
these models might be limited. In addition, current approaches for combining
models disregard cues such as question-answer compatibility. In this work, we
propose to combine expert agents with a novel, flexible, and training-efficient
architecture that considers questions, answer predictions, and
answer-prediction confidence scores to select the best answer among a list of
answer candidates. Through quantitative and qualitative experiments we show
that our model i) creates a collaboration between agents that outperforms
previous multi-agent and multi-dataset approaches in both in-domain and
out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be
adapted to any QA format. We release our code and a dataset of answer
predictions from expert agents for 16 QA datasets to foster future developments
of multi-agent systems on https://github.com/UKPLab/MetaQA.
comment: Accepted at EACL 2023
♻ ☆ MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
While there has been a recent burgeoning of applications at the intersection
of natural and programming languages, such as code generation and code
summarization, these applications are usually English-centric. This creates a
barrier for program developers who are not proficient in English. To mitigate
this gap in technology development across languages, we propose a multilingual
dataset, MCoNaLa, to benchmark code generation from natural language commands
extending beyond English. Modeled off of the methodology from the English
Code/Natural Language Challenge (CoNaLa) dataset, we annotated a total of 896
NL-code pairs in three languages: Spanish, Japanese, and Russian. We present a
quantitative evaluation of performance on the MCoNaLa dataset by testing with
state-of-the-art code generation systems. While the difficulties vary across
these three languages, all systems lag significantly behind their English
counterparts, revealing the challenges in adapting code generation to new
languages.
♻ ☆ ProKD: An Unsupervised Prototypical Knowledge Distillation Network for Zero-Resource Cross-Lingual Named Entity Recognition AAAI 2023
For named entity recognition (NER) in zero-resource languages, utilizing
knowledge distillation methods to transfer language-independent knowledge from
the rich-resource source languages to zero-resource languages is an effective
means. Typically, these approaches adopt a teacher-student architecture, where
the teacher network is trained in the source language, and the student network
seeks to learn knowledge from the teacher network and is expected to perform
well in the target language. Despite the impressive performance achieved by
these methods, we argue that they have two limitations. Firstly, the teacher
network fails to effectively learn language-independent knowledge shared across
languages due to the differences in the feature distribution between the source
and target languages. Secondly, the student network acquires all of its
knowledge from the teacher network and ignores the learning of target
language-specific knowledge. Undesirably, these limitations would hinder the
model's performance in the target language. This paper proposes an unsupervised
prototype knowledge distillation network (ProKD) to address these issues.
Specifically, ProKD presents a contrastive learning-based prototype alignment
method to achieve class feature alignment by adjusting the distance among
prototypes in the source and target languages, boosting the teacher network's
capacity to acquire language-independent knowledge. In addition, ProKD
introduces a prototypical self-training method to learn the intrinsic structure
of the language by retraining the student network on the target data using
samples' distance information from prototypes, thereby enhancing the student
network's ability to acquire language-specific knowledge. Extensive experiments
on three benchmark cross-lingual NER datasets demonstrate the effectiveness of
our approach.
comment: AAAI 2023